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Abstract 

We study the approximability of instances of the minimum entropy 
set cover problem, parameterized by the average frequency of a ran- 
dom element in the covering sets. We analyze an algorithm combining 
a greedy approach with another one biased towards large sets. The al- 
gorithm is controled by the percentage of elements to which we apply 
the biased approach. The optimal parameter choice has a phase tran- 
sition around average density e and leads to improved approximation 
guarantees when average element frequency is less than e. 



1 Introduction 

The minimum entropy set cover problem (MESC) |Halperin and Karp(2005)] 
arose from a maximum likelihood approach to haplotype inference in com- 
putational biology (see also Mandoiu and Pa§aniuc(2005)| ). Halperin and 



Karp showed that the problem is NP-complete and provided an additive up- 
per bound (equal to three) on the performance of the Greedy algorithm. This 
was later improved by Cardinal et al. Cardinal et al.(2008a)| , who showed a 
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tight additive upper bound of log 2 (e). Cardinal et al. 

Cardinal et al.(2012)| also studied several versions of this problem, notably 



minimum entropy graph coloring [Cardinal et al.(2004)] and minimum en- 
tropy orientation |Cardinal et al. (2 008b) |, as well as a generalization to ar- 
bitrary objective functions Cardinal and Dumeunier(2008)| . Minimum en- 
tropy graph coloring has found applications to problems related to functional 
compression in information theory Cardinal et al.(2004)| . 

Minimum entropy set cover also lies behind a recently proposed family 
of measures of worst-case fairness in cost allocations in cooperative game 
theory Bonchi§ and Istrate(2012a)] . This was accomplished by first studying 
Bonchi§ and Istrate(2012b)| a minimum entropy version of the well-known 



submodular set cover problem |Wolsey(1982)j |Fujita(2000)] . Submodularity 
corresponds in the setting of cooperative game theory to concavity of the 
associated game, a property that guarantees many useful features of the game 
such as the non-emptiness of the core, membership of the Shapley value in the 
core, equivalence between group-strategyproofness and cross-monotonicity in 
mechanism design |Moulin(1999)| and so on. 

In this paper we further study MESC restricted to sparse instances, that 
is to instances of Set Cover parameterized by / (formally defined below), the 
average number of sets that cover a random element. In the spirit of the 
minimum entropy orientation problem (a version of MESC for which / = 2) 
we aim to provide better approximation guarantees than those valid for the 
Greedy algorithm. To accomplish this goal we study the performance of 
an approximation algorithm BiasedGreedy(5) parameterized by a constant 
5e[0,l]. 

Our main result can be summarized as follows: we give general upper 
bounds on the performance of our proposed algorithm. These bounds im- 
prove on the approximation guarantee of the greedy algorithm when average 
element frequency is less than the constant e. Furthermore, the best choice of 
control parameter 5 depends on this frequency: it corresponds to the choice 
of a "biased" algorithm below critical value e, and to the greedy algorithm 
above it. 

The paper is structured as follows: in Section 2 we review basic notions 
and define the algorithm BiasedGreedy. The main result is presented and 
further discussed in Section 3. Its proof is given in Section 4. Next we 
present several applications of our main result to the Minimum Entropy 
Graph Coloring problem. 
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2 Preliminaries 



In this paper we need the definition of Shannon entropy and its associated 
divergence of two distributions P and Q: 

D(P || Q)=5>lo&A 

We recall that D(P \\ Q) > for all P and Q. 

We are concerned with the following problem: 

Definition 1. [MINIMUM ENTROPY SET COVER (MESC)]: Let 

U = {ui,u 2 , ■ ■ ■ ,u n } be an n— element ground set, for some n > 1, and let 
V = {Pi, P 2 , . . . , P m } be a family of subsets of U which cover U. A cover is 
a function g : U — >■ [m] such that for every 1 < i < n, 

e P g ( Ui) ( u Ui is covered by set P fl ( Ui )") 

The entropy of cover g is defined by: 

[OBJECTIVE:] Find a cover g of minimum entropy. 
Consider an instance (U,V) as above. Define 





\pi\ 









the average frequency of a random element in U. 

In the algorithm below we divide the elements of the ground set into Light 
and Heavy elements, based on their frequency of occurrence. Parameter 5 
controls this division: the least frequent Sn elements are deemed Light, while 
the rest are considered Heavy. 

Informally, the algorithm will first covers Light elements in a biased man- 
ner, simultaneously covering each such elemen by a set of maximum cardinal- 
ity containing it. Once this phase is complete all Light elements are deleted 
from all sets. The Heavy elements are handled in an incremental manner via 
a Greedy approach. The algorithm is formally presented in the following: 
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INPUT: An instance (U,V) of MESC 



V H := {Pf 1 , Pf , . . . , pH } w here PP =P t \L for all i G [k] 

While (there exists e G L) 
choose i e G [A;] to maximize |Pj e | where Pi e 3 e; 
let 5(e) = i e ; 
L:=L\{e}; 

While (there exists e £ H) 
choose i e G [k] to maximize \P^\ where PP B e; 
let g(e) = i e ; 
erase e from all Pf 1 ; 
H:=H\{e}; 

OUTPUT: the cover g. 



Figure 2.1: BiasedGreedy(<5) 

3 Main result 

Our main result shows that the following upper bound on the performance 
of algorithm BiasedGreedy holds: 

Theorem 1. Algorithm BiasedGreedy (5) produces a cover BG : U i — > [k] 
satisfying: 

Ent(BG) < Ent(OPT) -(1-5) log 2 ( ^— — j + log 2 / + o(l). (3.1) 



Corollary 1. The Biased algorithm, defined as the BiasedGreedy algorithm 
with 5 = 1, produces a cover 5/ whose entropy satisfies 

Ent(BI) < Ent(OPT) + log 2 /. (3.2) 

Observation 1. Optimizing over constant 5 in inequality (13. ip reveals an 
interesting fact: the optimal choice of 5 is always 5 G {0, 1}, i.e. the pure 
Biased or Greedy algorithms. More precisely 
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• choice 5=1 (i.e. Biased) is optimal for / < e. 

• when f > e choice 5 = (i.e. Greedy) becomes best. 

Thus the optimal choice for 5 has a phase transition from 5 — 1 to 5 — 
around average density f = e. 



4 Proof of the main result 

Proof. Let _BG be the cover generated by the BiasedGreedy algorithm, and 
denote by p\ = ^ BG '— ^ the associated probability distribution. 

If OPT is the optimal solution of the same instance, denote x\ = \OPT^ 1 {i) 
and fji = \OPT~ l {i) n Heavy\ for all 1 < i < k. By choice of 5, Yli=iVi = 

n — < (1 — 5)n while Yli=i x i = n - 
We rewrite the entropy of BG as follows: 

Ent(BG) = - J> b log 2 ^ = -J> b log 2 (V^) 

i=l i=l ^ I *l/ 

Denoting by # = the distribution #j = J F '| we obtain: 



£nf (AG) = - J> b log 2 |P,| - £p[ log 2 f - + log 2 \ p i\ 

;-i .--1 TT* „_1 



- ]Tp b log 2 \P,\ - D(BG || #) + log 2 \ p i\ (4- 1 ) 



i=i i=i 
Considering now just the first sum we obtain 

k 



£rflog 2 |P,| = -£i^^log 2 |P,| = -i£ 53 log 2 |i> 

i=l i=l i=l veBG- 1 ^) 

= — y] iog 2 i-pbg(«) i = — - y] ic, g2 ^ 

where a„ is the size of the set assigned by BiasedGreedy to cover v. 



BG(v) I 
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Continuing, we infer 



- Y p b i lo s2 \ p i\ = - 1 Y Y lo s2 a v = -- Y lo s2 n 

i=l i=l neOPT" 1 ^) i=l ueOPT" 1 ^) 

1 fc 



n 

i=l veOPT-itynLight v£OPT- 1 (i)nHeavy 



From the definition of the algorithm we conclude the following: 
• for all v G OPT-^i) n Light, 



a v =\Pbg(v)\ = max \Pj\ > |P pt(«)| > |OPT (i)| = 

• On the other hand, for v G OPT~ l {i) D Heavy we analyze the Greedy 
phase of BiasedGreedy algorithm in a manner completely similar to the 
analysis of the Greedy algorithm in Cardinal et al.(2008a)| and infer 
that 

n av - yi - 

v£OPT- 1 (i)nHeavy 

Therefore, 

k 



-^p!io g2 iPii <-^iog 2 n ^ Wo= 

1 1 ^ 1 ^ 

= Y l0g2 (3/»0 = Y^ 1 ~ f<) l0g 2 Xi Y l0g 2 ^' 

U i=l " i=l n i=l 

= - — log 2 — - log 2 n + - Y yi log 2 Xi V lo S2 

z — ✓ n n n z — ' n L — ' 

j=l 1=1 i=l 

Applying now the inequality y! > (y/e) y we obtain: 



6 



k „ k 



- f> b log 2 \Pi\ < Ent{OPT) - log 2 h+Tk log 2 - - V log 2 ^ 

i=l i=l i=l 

= Ent(OPT) - log 2 n + - V" y 4 log 2 x t V" y, L log 2 y 4 + - log 2 e 

i=l i=l i=l 

1 ^ 

< Ent(OPT) - log 2 n - - V) j/< log 2 ^ + (1 - <5) log 2 e (4.2) 

i=l 

Considering now distributions x7 = — , y~ = =^ we obtain 

-i £ » log, * = - i £(» " M >^°& ( "' r -" 1)j " 

i=l 1 i=l 1 

= >. Vi iog 2 = iog 2 

i=i 1 

(n - Wl jpfa || x) - ^ ~ l"H ) log2 ( w - W) 
n n "~ n 

Putting all things together: 

Ent(BG) <Ent(OPT) -(1-5) log 2 (l - 6) + (1 - 6) log 2 e + log 2 / + o(l). 
and the proof is complete. □ 

5 Application to minimum entropy graph col- 
oring 



Just as it is the case with the Greedy algorithm Cardinal et al. (2012)] , our 



result has implications for the minimum entropy coloring problem. This 
problem can be recast as an implicit set cover problem |Karp(2011) , where 



the sets are the maximal independent sets in G. Given the intractability of 
the maximum independent set problem, we can only efficiently implement the 
Biased algorithm on special classes of graphs, where this problem is easier. 
On the other hand algorithm Biased has some nice properties, similar to 
those discussed in Cardinal et al.(2012)| ) for the Greedy algorithm: 
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it can be implemented in polynomial time on perfect graphs. Indeed, 
the largest independent set containing a given vertex can easily be 
computed in a perfect graph. 

it allows the use of 77- approximately optimal independent sets (for some 
constant r\ > 1) instead of optimal ones at the expense of introducing 
an extra factor of log 2 (?7) in the upper bound of equation (13.11) . This 
follows easily by simply redoing the proof of the main Theorem in this 
setting. We can apply this observation to get a slight improvement of 
Theorem 8 from Cardinal et al.(2012)| when / < e: 



Corollary 2. Algorithm Biased produces a coloring of a graph G 
(V, E) with maximum degree A satisfying 

Ent(Biased) < Ent(OPT) + log 2 (A + 2) + log 2 (//3). 



The proof of the corollary directly parallels that of Theorem 8 from 
Cardinal et al.(2012)] . 

Applying Theorem [1] to graph coloring problems is rather inconvenient as 
parameter / involves maximal independent sets and is not easy to compute. 
The situation is slightly better for graphs with independence number a(G) < 
3. In this case maximal independent sets correspond either to triangles, 
edges, or isolated vertices in the complement graph G. Parameter / also has 
an easier interpretation: Let I be the number of isolated vertices in G. Let 
T be the number of distinct triangles in G. Finally, let M be the number of 
edges that are not contained in any triangle. Then 

I + 2M + 3T 

n 

Furthermore, in this case the algorithm Biased has a very natural inter- 
pretation: we create a tentative color cw for any maximal independent set 
(triangle, edge or isolated vertex) W and add color cw to a list coloring of all 
vertices in W. Then for each vertex we select a random color from its list. 

The algorithm Biased can be improved in practice by employing a number 
of heuristics such as: 

• Attempt to color all elements of a largest independent set with the 
same color. 
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Collapse two colors into one if legal. 



These heuristics can only decrease the entropy of the resulting coloring. 

There are instances (e.g. edge orientations of a cycle from Cardinal et al. (2008b) |) 



where Biased outperforms Greedy. But even when it doesn't, our analysis 
may provide better theoretical guarantees than those available for Greedy. 

Example 1. Consider graph G = (V, E) from Figure l5A\ (a) (its complement 
is displayed in Figure POl (b)). 




Figure 5.1: (a) graph G (b) its complement G 




Figure 5.2: Two colorings C\ and C*2 of graph G. For convenience the com- 
plement graph G is pictured, rather than G. G\ is an optimal solution. 

Graph G provides an easy instance where Greedy and Biased (may) pro- 
duce different colorings. Indeed, node 4 is colored by Biased with a color cor- 
responding to a triangle, whereas 5 takes a color corresponding to an edge, so 
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nodes 4 and 5 must assume different colors in a Biased coloring, whereas they 
may have the same color in a Greedy coloring. With the optimizations de- 
scribed above both Greedy and Biased produce one of the two colorings Ci, C2 
from Figure{J\ with color classes of cardinalities (3; 3; 2; 0) and (3;2;2;1) ; 
respectively. The first one corresponds to the optimal solution. On the other 
hand for this graph the average element frequency f = 3x3 + lx2 = ±k < e> so 
the upper bound on the entropy of coloring C2 given by Corollary \3.2\ is tighter 
than the one provided by the Greedy algorithm in [Cardinal et al.(2012)^ . 
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